Optimization and validation of 18F-DCFPyL PET radiomics-based machine learning models in intermediate- to high-risk primary prostate cancer

Introduction Radiomics extracted from prostate-specific membrane antigen (PSMA)-PET modeled with machine learning (ML) may be used for prediction of disease risk. However, validation of previously proposed approaches is lacking. We aimed to optimize and validate ML models based on 18F-DCFPyL-PET radiomics for the prediction of lymph-node involvement (LNI), extracapsular extension (ECE), and postoperative Gleason score (GS) in primary prostate cancer (PCa) patients. Methods Patients with intermediate- to high-risk PCa who underwent 18F-DCFPyL-PET/CT before radical prostatectomy with pelvic lymph-node dissection were evaluated. The training dataset included 72 patients, the internal validation dataset 24 patients, and the external validation dataset 27 patients. PSMA-avid intra-prostatic lesions were delineated semi-automatically on PET and 480 radiomics features were extracted. Conventional PET-metrics were derived for comparative analysis. Segmentation, preprocessing, and ML methods were optimized in repeated 5-fold cross-validation (CV) on the training dataset. The trained models were tested on the combined validation dataset. Combat harmonization was applied to external radiomics data. Model performance was assessed using the receiver-operating-characteristics curve (AUC). Results The CV-AUCs in the training dataset were 0.88, 0.79 and 0.84 for LNI, ECE, and GS, respectively. In the combined validation dataset, the ML models could significantly predict GS with an AUC of 0.78 (p<0.05). However, validation AUCs for LNI and ECE prediction were not significant (0.57 and 0.63, respectively). Conventional PET metrics-based models had comparable AUCs for LNI (0.59, p>0.05) and ECE (0.66, p>0.05), but a lower AUC for GS (0.73, p<0.05). In general, Combat harmonization improved external validation AUCs (-0.03 to +0.18). Conclusion In internal and external validation, 18F-DCFPyL-PET radiomics-based ML models predicted high postoperative GS but not LNI or ECE in intermediate- to high-risk PCa. Therefore, the clinical benefit seems to be limited. These results underline the need for external and/or multicenter validation of PET radiomics-based ML model analyses to assess their generalizability.


Introduction
Radiomics extracted from prostate-specific membrane antigen (PSMA)-PET modeled with machine learning (ML) may be used for prediction of disease risk.However, validation of previously proposed approaches is lacking.We aimed to optimize and validate ML models based on 18 F-DCFPyL-PET radiomics for the prediction of lymph-node involvement (LNI), extracapsular extension (ECE), and postoperative Gleason score (GS) in primary prostate cancer (PCa) patients.

Methods
Patients with intermediate-to high-risk PCa who underwent 18 F-DCFPyL-PET/CT before radical prostatectomy with pelvic lymph-node dissection were evaluated.The training dataset included 72 patients, the internal validation dataset 24 patients, and the external validation dataset 27 patients.PSMA-avid intra-prostatic lesions were delineated semiautomatically on PET and 480 radiomics features were extracted.Conventional PET-metrics were derived for comparative analysis.Segmentation, preprocessing, and ML methods were optimized in repeated 5-fold cross-validation (CV) on the training dataset.The trained models were tested on the combined validation dataset.Combat harmonization was applied to external radiomics data.Model performance was assessed using the receiver-operatingcharacteristics curve (AUC).

Results
The CV-AUCs in the training dataset were 0.88, 0.79 and 0.84 for LNI, ECE, and GS, respectively.In the combined validation dataset, the ML models could significantly predict

Introduction
Risk stratification of primary prostate cancer (PCa) is important for determining prognosis and selecting optimal treatment strategies.Currently, this is based on clinical tumor stage (cTstage), prostate-specific antigen (PSA) serum level, and the International Society of Urologic Pathology (ISUP) score in prostate biopsies [1].These data are typically included in preoperative nomograms in patients without distant metastatic disease on conventional imaging (i.e., bone scintigraphy and computed tomography (CT)) and, more recently, prostate-specific membrane antigen (PSMA) positron emission tomography (PET) [2,3].These nomograms calculate the risk of lymph node involvement (LNI) and determine if extended pelvic lymph node dissection (ePLND) is indicated.However, ePLND remains the gold standard for determining lymph node status [1].
In recent years, modern imaging modalities, such as multiparametric magnetic resonance imaging (mpMRI) and PSMA-PET/CT, have been implemented in PCa care.Paralleled with the growth of artificial intelligence (AI) in medical imaging, this has triggered many investigations into AI-based image analysis for tumor staging and risk classification in primary PCa.One such approach is the use of machine learning (ML) models applied to radiomics data from prostate MRI or PSMA-PET/CT [4][5][6][7].In radiomics, high-dimensional data is extracted from radiological images.ML algorithms can be trained to transform these radiomics data into clinically applicable predictions [8,9].
In a previous analysis, we performed a machine learning-based analysis of 18 F-DCFPyL (an 18 F-labeled PSMA radioligand) PET/CT radiomics in patients with intermediate-to high-risk PCa scheduled for robot-assisted radical prostatectomy (RARP) with ePLND to predict LNI and high-risk tumor features, and observed excellent cross-validated predictive values [10].However, this approach has not yet been validated to confirm its generalizability.
A known limitation of radiomics analyses is the often limited reproducibility caused by sensitivity to differences in imaging procedures (e.g., acquisition, reconstruction, segmentation) and factors related to radiomics calculation [9].Therefore, in this study, we optimized and validated machine learning models based on 18 F-DCFPyL PET radiomics, including Combat harmonization to compensate for possible center effects, in patients with intermediate-to high-risk primary PCa scheduled for RARP with ePLND [10].

Patients
Patients with newly diagnosed, biopsy-proven intermediate-to high-risk PCa who underwent 18 F-DCFPyL PET/CT imaging prior to RARP and ePLND were included.ePLND was performed if there were high-risk factors (i.e., serum PSA level >20 ng/mL, ISUP score �4, clinical or radiological tumor stage �3) or a nomogram-based risk of LNI of 8% or greater [1].Patients underwent 18 F-DCFPyL PET/CT at two centers in the Netherlands (Amsterdam University Medical Center (Amsterdam UMC) and the Northwest Clinics (NWZ) Alkmaar) between April 2017 and June 2020.Baseline clinical and pathology data were registered (i.e., PSA level, clinical tumor stage, biopsy ISUP score, and the percentage of positive prostate biopsies).

Training and validation datasets
We used a training dataset (n = 72/76, excluding 4 patients with distant metastases from the previously published internal cross-validation (CV) [10]) from Amsterdam UMC to optimize methodologies and train the machine learning models.The models were applied to internal (n = 24) and external (n = 27) validation datasets from Amsterdam UMC and NWZ, respectively.This resulted in a combined validation dataset of 51 patients.

Preoperative clinical and pathology data for baseline models
The radiomics-based machine learning models were compared to clinically used methods (henceforth referred to as baseline models) to assess their added value.The Memorial Sloan Kettering Cancer Center (MSKCC) nomogram was applied to calculate the risk of LNI and ECE and the corresponding AUCs [2].The AUC of the biopsy Gleason score (GS) baseline model was calculated by using the biopsy GS as input.

Postoperative pathology data
The surgical tissue specimens of the prostate and lymph nodes were examined according to the international guidelines of uropathologists for primary tumor and nodal staging [1].The three selected reference outcomes for analysis were dichotomized, being postoperative GS (<8 vs. �8), presence of ECE (�pT2b vs. �pT3a) and LNI (pN0 vs. pN1).

Image acquisition and interpretation
PET input data (i.e., dose, calibration, injection, and scan times) were collected.At the Amsterdam UMC, patients were scanned on a time-of-flight PET/CT-scanner (Ingenuity or Vereos, Philips Healthcare) with European Association of Nuclear Medicine (EANM) Research Ltd. (EARL) accreditation [11].Whole-body PET-images were obtained from mid-thigh to skull base at 4 minutes per bed position at 119 minutes (interquartile range (IQR) 116-125) postinjection (p.i.) of a median dose of 308 MBq (IQR 293-318) 18 F-DCFPyL.Data were reconstructed using the vendor-provided iterative reconstruction algorithm using 3 iterations and 33 subsets with 4 mm isotropic voxels (EARL-compliant).PET-images were combined with a non-contrast-enhanced low-dose CT-scan (30-110 mAs at 120 kV).
PET-images were corrected for scatter, decay, randoms, and attenuation.In addition to the originally reconstructed images, the Lucy-Richardson iterative deconvolution was applied for partial volume correction (PVC) [12].

Tumor segmentation
The latest version of the ACCURATE tool was used for tumor segmentation on PET-images [13].A mask was manually drawn around the PSMA-avid prostatic lesion on PET to prevent the inclusion of activity from other structures, such as the urinary bladder or rectum.The masks were reviewed by a second observer.If necessary, the masks were jointly adjusted.Subsequently, the tumor was delineated semi-automatically using a region-growing isocontour corresponding to 50%, 55%, 60%, 65%, and 70% of the peak Standardized Uptake Value (SUV) with or without correction for local background uptake.The original and PVC images were delineated separately.

Radiomics feature extraction
Radiomics features were calculated from the segmented region using the Image Biomarker Standardization Initiative (IBSI)-compliant RaCat software [14,15].Voxels were resampled to 2 mm isotropic voxels using tri-linear interpolation with aligned edges [16,17].Voxels were scaled to the SUV.Before calculating textural features, discretization was applied with a fixed bin width of 0.25 SUV units.A total of 480 radiomics features on texture (n = 408), intensity (n = 50) and morphology (n = 22) were extracted per patient.Additionally, conventional PET metrics (e.g., SUV values, volume, and total PSMA uptake) were derived for comparative analysis.

Machine learning
Random Forest (RF) and Logistic Regression (LR) were used as machine learning classifiers.For data normalization, we applied i) z-score standardization and ii) Yeo-Johnson transformation [18].Dimension reduction was performed using i) Principal Component Analysis, ii) Recursive Feature Elimination, iii) univariate feature selection, and iv) the Least Absolute Shrinkage and Selection Operator [19].The Synthetic Minority Oversampling Technique (SMOTE) was applied to correct data imbalance [20].To mitigate the center effect, Combat harmonization (using the neuroCombat implementation) was applied to external radiomics data, using the training data as reference dataset [21,22].Also, the impact of filtering features (based on linear correlations) before dimension reduction on was evaluated.Lastly, the influence of adding clinical parameters (initial PSA and biopsy ISUP score) to the data after radiomics feature selection was evaluated.
The optimal combination of tumor segmentation settings, use of PVC, and machine learning methods (data normalization, dimension reduction, oversampling) was selected using 10-times repeated 5-fold CV on the training dataset.Model hyperparameters were optimized in nested cross-validation.The models with optimal machine learning configurations were then trained on the entire training dataset and tested on the validation datasets (settings in Table 1).

Statistical analysis
The area under the curve (AUC) of the receiver operator characteristic (ROC) curve, sensitivity and specificity was calculated to assess model performance.In CV, the mean AUC (CV-AUC) was calculated.In each CV iteration, the probability threshold at the Youden index was determined.The mean probability threshold was then used to calculate sensitivity and specificity in the validation datasets.The DeLong test was performed to compare the AUC of different models applied to the validation data [23].Machine learning analyses were executed in Python 3.7 using the SciKit package [24].Statistical analysis was performed using SPSS version 28 and GraphPad Prism version 9. Statistical significance was considered at p<0.05.

Model validation
Highest CV-AUCs in the training dataset were 0.88 for LNI, 0.79 for ECE, and 0.84 for GS, respectively (all using the Random Forest classifier; see Table 1 for configurations).The radiomics models outperformed the conventional metrics models in CV (LNI 0.78, ECE 0.72, and GS 0.80).Compared to the CV-AUCs, in the combined validation dataset the radiomics-based machine learning models yielded lower non-significant AUCs for LNI (0.57) and ECE (0.63), but a significant AUC of 0.78 for GS (p<0.05).In the combined validation dataset, the models based on conventional PET metrics had similar AUCs for LNI (0.59, p>0.05) and ECE (0.66, p>0.05), and a lower, but significant AUC for GS (0.73, p<0.05).No significant differences were found between the radiomics-and conventional PET metrics-based AUCs of the combined validation dataset.The AUCs, sensitivity and specificity of the radiomics-based machine

Impact of combat harmonization
The effect of Combat harmonization on the performance of the radiomics-based machine learning models in the external validation dataset for LNI, ECE, and GS predictions is illustrated in Fig 2 .In general, Combat harmonization improved the external validation AUCs with a range of -0.03 to +0.18.

Impact of adding clinical data
Adding clinical parameters (initial PSA and biopsy ISUP score) to the radiomics data after feature selection did not improve model performance for LNI, ECE, or GS prediction for the combined validation dataset (Fig 3) .

Comparison with baseline models
The AUC of the MSKCC-nomogram for predicting LNI was 0.68 (p>0.05) in the training dataset and 0.51 (p>0.05) in the combined validation dataset.The AUC of the MSKCC-

Discussion
In this study, we optimized and validated the performance of a previously proposed machinelearning approach using 18 F-DCFPyL PET radiomics for predicting high-risk tumor features in patients with intermediate-to high-risk PCa in a multicenter dataset.In this validation, we found that 18 F-DCFPyL PET radiomics could predict high postoperative GS but not LNI and ECE.Therefore, the clinical benefit seems to be limited.Still, this implies that radiomics-based machine learning models could enhance preoperative GS determination in clinical practice and potentially help clinicians in their daily decision-making process.The discrepancy between positive results from internal cross-validation and negative results from multicenter validation emphasizes the importance of such validation studies for radiomics analyses.Newly diagnosed patients with PCa are typically stratified into three risk groups (i.e., low-, intermediate-, and high-risk PCa) based on PSA level, clinical T-stage, and biopsy GS [1].Accurate prediction of the GS is essential for treatment guidance, to avoid over-or undertreatment.Not all patients undergo radical prostatectomy, so GS obtained with biopsy is usually used in daily practice regardless of the knowledge of discordance between biopsy and prostatectomy GS [25,26].However, since the introduction of targeted prostate biopsies, pathological upgrading at prostatectomy has decreased significantly.A recent systematic review and meta-analysis by Goel et al. evaluated the concordance between systematic or MRI-targeted prostate biopsies and prostatectomy GS.They found that pathologic upgrading at prostatectomy was significantly less frequent with targeted biopsy (23.3%) compared to systemic biopsy alone (42.7%), without significant differences in pathological downgrading between the two biopsy techniques [26].Our present study, using radiomics-based models, could predict postoperative GS with an AUC of 0.78, and the AUC of our baseline model was 0.85.This high AUC of our baseline model might be explained by the improved accuracy of (target) prostate biopsies to determine the GS correctly.However, biopsies are subsamples of the prostate, making it subject to sampling errors, and minor complications (i.e., bleeding or infection) are frequent.Therefore, radiomics-based machine learning models could be a helpful non-invasive tool to assist in GS determination in patients with primary PCa.
Several studies have assessed the value of PSMA-PET radiomics to predict GS in PCa [27][28][29][30][31].However, multicenter validation is often lacking [32].Zamboglou et al. reported similar results AUC for GS discrimination (GS 7 vs. �8) with 68 Ga-PSMA-PET radiomics in their internal validation cohort [31].Papp et al. found that radiomics combined with machine learning could discriminate between low-and high-risk PCa lesions on 68 Ga-PSMA-PET/MR [28].The use of 68 Ga-PSMA-PET/MR radiomics was also investigated by Solari et al., who reported an AUC of 0.75 for predicting postsurgical GS with their PET radiomics model.Their method differed from ours in that they segmented the whole prostate gland and trained their model to predict three instead of two GS categories (GS <8, 8, and >8) [29].Recently, Aksu et al. evaluated whether 68 Ga-PSMA-PET/CT radiomics (without machine learning) could predict GS �8 and found an AUC of 0.90 [27].Using 18 F-1007-PSMA-PET radiomics, Yao et al. found a similar predictive performance for GS characterization (training AUC 0.82; testing AUC 0.80) [30].In summary, our results are consistent with previous research, suggesting that PSMA--PET radiomics could be of added value for the preoperative prediction of postoperative GS.
In our training dataset, we found high cross-validation prediction scores for each outcome, whereas in the validation dataset, the radiomics-based models could not predict LNI and ECE (AUCs of 0.57 and 0.63).The same applied to the models using conventional PET metrics as input, implying that the issue is not merely related to the robustness of texture-based radiomics features.Also, two baseline models (MSKCC nomogram predictions for LNI and ECE) reached higher or equal AUCs in the training dataset compared to the validation cohort.
Taken together, this may indicate that the low prediction AUCs in the (combined internal and external) validation datasets are due to inherent characteristics of the validation dataset instead of merely overfitted or biased models.In contrast, the performance of the biopsy baseline models to predict postoperative GS reached higher AUCs in the validation dataset compared to the training dataset.This prediction is based only on the biopsy GS, which relies on various factors such as the biopsy technique (systematic versus targeted), the number of biopsies taken, and the physician's experience performing the procedure.
As is well known, radiomics features (especially those based on texture) are sensitive to variations in PET systems, image acquisition protocols, reconstruction settings, post-processing, and segmentation methods.As the purpose of this study was a multicenter validation, the validation dataset consisted of patients from two different institutions with differences in PET systems, reconstruction software, and imaging protocols.A radiomics-based machine learning model trained with data from one center may not apply directly to data from another.Therefore, to minimize the influence of a center effect, Combat harmonization was applied to harmonize the radiomics features.Indeed, this improved our validation AUCs of the external dataset.To date, the use of Combat harmonization in PCa radiomics analyses is limited.However, the Combat harmonization method has been used in several contexts, and its efficacy in PET imaging has been demonstrated [22,[33][34][35].
Our study is not devoid of limitations.First, most patients had high-risk disease according to the EAU guidelines, and our results should also be validated in patient cohorts with lowerrisk disease.Secondly, we only selected patients who had undergone radical prostatectomy as we chose postoperative GS as one of the reference outcomes.However, patients undergoing radiotherapy instead of radical prostatectomy are not expected to differ much from the current population.Thirdly, the anatomical locations of the prostate tumors delineated on PET were not directly compared to the prostatectomy specimen.Potentially, less PSMA-avid lesions on PET may be missed in, for example, multifocal tumors.In clinical practice, however, anatomical information from the prostatectomy specimen is, logically, not available at pre-operative imaging.We chose to delineate the most PSMA-avid foci within the prostate based on semiautomatic lesion segmentation.This could impact the radiomics feature values and, consequently, our results.Still, our results are in line with other data where whole-prostate radiomics were derived [29].Also, semi-automatic segmentation analyses suffer less from observer variability than manual segmentations [36].In future prospective research, additional information on disease recurrence during follow-up may help evaluate the utility of the machinelearning algorithm in treating patients with PCa.

Conclusion
The 18 F-DCFPyL radiomics-based machine learning models could predict high postoperative GS but not LNI or ECE.Still, these results are in line with the baseline models.This implies that radiomics-based machine learning models could enhance preoperative GS determination in clinical practice and potentially help clinicians in their daily decision-making process.Furthermore, these results underline the need for external and/or multicenter validation of PET radiomics-based machine learning model analyses to investigate their reproducibility and clinical applicability.

Fig 4 .
Fig 4. ROC-curves of the baseline models in the training dataset and the combined validation dataset.The pre-radical prostatectomy MSKCCnomogram for lymph node involvement (LNI; A) and extracapsular extension (ECE; B) prediction and the biopsy baseline model for postoperative Gleason score (GS; C) prediction.https://doi.org/10.1371/journal.pone.0293672.g004

Table 1 . The optimal machine learning configuration based on the cross-validated AUCs of the training dataset.
under the curve; LNI = lymph node involvement; ECE = extracapsular extension; GS = Gleason score; RF = random forest; LR = logistic regression; VOI = volume of interest; PVC = partial volume correction; RFE = Recursive Feature Elimination; LASSO = Least Absolute Shrinkage and Selection Operator, SMOTE = synthetic minority over-sampling technique.
*treshold defined as percentage of SUVpeak https://doi.org/10.1371/journal.pone.0293672.t001learning models are listed in Table 4.The ROC-curves for these analyses are given in Fig 1.The AUCs of the different machine learning classifiers with/without Combat harmonization or the addition of clinical parameters for predicting LNI, ECE, and GS are summarized in S1 Table.